20/02/2020

Big Data?

Posted on flickr by BBVAtech in 2012, by Asigra [CC BY 2.0](https://creativecommons.org/licenses/by/2.0/)

Posted on flickr by BBVAtech in 2012, by Asigra CC BY 2.0

Expert Survey (UC Berkeley, 2014)

  • Ask 40 experts to define “big data”

Expert Survey (UC Berkeley, 2014)

  • Ask 40 experts to define “big data”
  • … get 40 different definitions :)

Expert Survey (UC Berkeley, 2014)

Expert Survey: Example 1

“Big Data is the result of collecting information at its most granular level — it’s what you get when you instrument a system and keep all of the data that your instrumentation is able to gather.”

Jon Bruner (Editor-at-Large, O’Reilly Media)

Expert Survey: Example 2

“Big data is data that contains enough observations to demand unusual handling because of its sheer size, though what is unusual changes over time and varies from one discipline to another.”

Annette Greiner
(Lecturer, UC Berkeley School of Information)

Expert Survey: Example 3

“[…] ‘big data’ will ultimately describe any dataset large enough to necessitate high-level programming skill and statistically defensible methodologies in order to transform the data asset into something of value.”

Reid Bryant
(Data Scientist, Brooks Bell)

Conclusion

Conclusion

  • Large amounts of data
  • Various types/formats of data
  • Unusual sources
  • Speed of data flow/stream
  • Use programming and statistics (in a broad sense) to extract value

‘Learn Big Data’?

Domains Affected

  • How to design/set-up the machinery to handle large amounts of data? (Hardware focuse, data engineering)
  • How to use the existing machinery most efficiently for large amounts of data?
  • How to approach the analysis of large amounts of data with econometrics?

Focus in This Course

  • How to design/set-up the machinery to handle large amounts of data? (Hardware focuse, data engineering)
  • How to use the existing machinery most efficiently for large amounts of data?
  • How to approach the analysis of large amounts of data with econometrics?

Focus in This Course

  • How to design/set-up the machinery to handle large amounts of data? (Hardware focuse, data engineering)
  • How to use the existing machinery most efficiently for large amounts of data?
  • How to approach the analysis of large amounts of data with econometrics?
    1. Compute ‘usual’ statistics based on large dataset (many observations).

Focus in This Course

  • How to design/set-up the machinery to handle large amounts of data? (Hardware focuse, data engineering)
  • How to use the existing machinery most efficiently for large amounts of data?
  • How to approach the analysis of large amounts of data with econometrics?
    1. Compute ‘usual’ statistics based on large dataset (many observations).
    2. Practical handling of large data sets for applied econometrics (gathering, storage, preparation, etc.)

Big Data in Scientific Research

Big Data in the Sciences

  • Mother nature always has provided the data, but…
    • … instruments have gotten more precise
    • … new measurement methods have been developed
  • Prominent examples: Astronomy, Genomics/Bioinformatics
Photo by Joe Parks, [(CC BY-NC 2.0)](https://creativecommons.org/licenses/by-nc/2.0/) source: https://flic.kr/p/e2umhv

Photo by Joe Parks, (CC BY-NC 2.0) source: https://flic.kr/p/e2umhv

Big Data in the Sciences

Big Data in the Sciences

Big Data in the Social Sciences

Big Data in the Social Sciences

  • Hardware: Diffusion of the Internet and mobile-phone networks.
  • Software: Web 2.0 Technologies (APIs, JSON, Programmable Web, etc.).

Big Data in the Social Sciences

  • Hardware: Diffusion of the Internet and mobile-phone networks.
  • Software: Web 2.0 Technologies (APIs, JSON, Programmable Web, etc.).
    • Backbone of social media and many prominent web services (e.g., Google Maps).
    • Data integration across platforms and services.
    • Exchange of data between/across applications.

Big Data in the Social Sciences/Economics

Source: @bollen_etal2011

Source: Bollen, Mao, and Zeng (2011)

Big Data in the Social Sciences/Economics

Source: @bollen_etal2011

Source: Bollen, Mao, and Zeng (2011)

Big Data in the Social Sciences/Economics

Source: @ranco_etal2015

Source: Ranco (2015)

Big Data in the Social Sciences/Economics

  • Often tied to web applications and digitization of economic and political processes.

Big Data in the Social Sciences/Economics

  • Often tied to web applications and digitization of economic and political processes.
  • Volume of data is substantial (but usually smaller than in the natural sciences).

Big Data in the Social Sciences/Economics

  • Often tied to web applications and digitization of economic and political processes.
  • Volume of data is substantial (but usually smaller than in natural sciences)
  • Variety and variability often more challenging than in natural sciences.
    • Various sources
    • Data generation/sensors are independent from research endeavor.

Big Data in the Social Sciences/Economics

  • Often tied to web applications and digitization of economic and political processes.
  • Volume of data is substantial (but usually smaller than in natural sciences)
  • Variety and variability often more challenging than in natural sciences.
    • Various sources
    • Data generation/sensors are independent from research endeavor.
  • Questions/problems often similar to applied research in the industry.
    • Key difference: usually no streaming applications (velocity not that much of an an issue).

This Course

Three Parts

  1. Big Data: Basic Concepts
  2. Local Big Data Analytics
  3. Advanced Topics

Objectives

  • Understand the concept of Big Data in the context of economic research.
  • Understand the technical challenges of Big Data Analytics and how to practically deal with them.
  • Know how to apply the relevant R packages and programming practices to effectively and efficiently handle large data sets.

Schedule

  1. Introduction: Big Data, Data Economy. Walkowiak (2016): Chapter 1.
  2. Computation and Memory in Applied Econometrics.
  3. Advanced R Programming. Wickham (2019): Chapters 2, 3, 17,23, 24.
  4. Import, Cleaning and Transformation of Big Data. Walkowiak (2016): Chapter 3: p. 74‑118.
  5. Aggregation and Visualization. Walkowiak (2016): Chapter 3: p. 118‑127; Wickham et al.(2015); Schwabish (2014).
  6. Data Storage, Databases Interaction with R. Walkowiak (2016): Chapter 5.
  7. Cloud Computing: Introduction/Overview.
  8. Distributed Systems, Hadoop with R. Walkowiak (2016): Chapter 4.
  9. Applied Econometrics with Spark; Machine Learning and GPUs.
  10. Project Presentations (7 May, 2020; 08:15-10:00; Room 23-103).
  11. Project Presentations; Q&A.

Schedule

  1. Introduction: Big Data, Data Economy. Walkowiak (2016): Chapter 1.
  2. Computation and Memory in Applied Econometrics.
  3. Advanced R Programming. Wickham (2019): Chapters 2, 3, 17,23, 24.
  4. Import, Cleaning and Transformation of Big Data. Walkowiak (2016): Chapter 3: p. 74‑118.
  5. Aggregation and Visualization. Walkowiak (2016): Chapter 3: p. 118‑127; Wickham et al.(2015); Schwabish (2014).
  6. Data Storage, Databases Interaction with R. Walkowiak (2016): Chapter 5.
  7. Cloud Computing: Introduction/Overview.
  8. Distributed Systems, Hadoop with R. Walkowiak (2016): Chapter 4.
  9. Applied Econometrics with Spark; Machine Learning and GPUs.
  10. Project Presentations (7 May, 2020; 08:15-10:00; Room 23-103).
  11. Project Presentations; Q&A.

Schedule

  1. Introduction: Big Data, Data Economy. Walkowiak (2016): Chapter 1.
  2. Computation and Memory in Applied Econometrics.
  3. Advanced R Programming. Wickham (2019): Chapters 2, 3, 17,23, 24.
  4. Import, Cleaning and Transformation of Big Data. Walkowiak (2016): Chapter 3: p. 74‑118.
  5. Aggregation and Visualization. Walkowiak (2016): Chapter 3: p. 118‑127; Wickham et al.(2015); Schwabish (2014).
  6. Data Storage, Databases Interaction with R. Walkowiak (2016): Chapter 5.
  7. Cloud Computing: Introduction/Overview.
  8. Distributed Systems, Hadoop with R. Walkowiak (2016): Chapter 4.
  9. Applied Econometrics with Spark; Machine Learning and GPUs.
  10. Project Presentations (7 May, 2020; 08:15-10:00; Room 23-103).
  11. Project Presentations; Q&A.

Schedule

  1. Introduction: Big Data, Data Economy. Walkowiak (2016): Chapter 1.
  2. Computation and Memory in Applied Econometrics.
  3. Advanced R Programming. Wickham (2019): Chapters 2, 3, 17,23, 24.
  4. Import, Cleaning and Transformation of Big Data. Walkowiak (2016): Chapter 3: p. 74‑118.
  5. Aggregation and Visualization. Walkowiak (2016): Chapter 3: p. 118‑127; Wickham et al.(2015); Schwabish (2014).
  6. Data Storage, Databases Interaction with R. Walkowiak (2016): Chapter 5.
  7. Cloud Computing: Introduction/Overview.
  8. Distributed Systems, Hadoop with R. Walkowiak (2016): Chapter 4.
  9. Applied Econometrics with Spark; Machine Learning and GPUs.
  10. Project Presentations (7 May, 2020; 08:15-10:00; Room 23-103).
  11. Project Presentations; Q&A.

Schedule

  1. Introduction: Big Data, Data Economy. Walkowiak (2016): Chapter 1.
  2. Computation and Memory in Applied Econometrics.
  3. Advanced R Programming. Wickham (2019): Chapters 2, 3, 17,23, 24.
  4. Import, Cleaning and Transformation of Big Data. Walkowiak (2016): Chapter 3: p. 74‑118.
  5. Aggregation and Visualization. Walkowiak (2016): Chapter 3: p. 118‑127; Wickham et al.(2015); Schwabish (2014).
  6. Data Storage, Databases Interaction with R. Walkowiak (2016): Chapter 5.
  7. Cloud Computing: Introduction/Overview.
  8. Distributed Systems, Hadoop with R. Walkowiak (2016): Chapter 4.
  9. Applied Econometrics with Spark; Machine Learning and GPUs.
  10. Project Presentations (7 May, 2020; 08:15-10:00; Room 23-103).
  11. Project Presentations; Q&A.

Schedule

  1. Introduction: Big Data, Data Economy. Walkowiak (2016): Chapter 1.
  2. Computation and Memory in Applied Econometrics.
  3. Advanced R Programming. Wickham (2019): Chapters 2, 3, 17,23, 24.
  4. Import, Cleaning and Transformation of Big Data. Walkowiak (2016): Chapter 3: p. 74‑118.
  5. Aggregation and Visualization. Walkowiak (2016): Chapter 3: p. 118‑127; Wickham et al.(2015); Schwabish (2014).
  6. Data Storage, Databases Interaction with R. Walkowiak (2016): Chapter 5.
  7. Cloud Computing: Introduction/Overview.
  8. Distributed Systems, Hadoop with R. Walkowiak (2016): Chapter 4.
  9. Applied Econometrics with Spark; Machine Learning and GPUs.
  10. Project Presentations (7 May, 2020; 08:15-10:00; Room 23-103).
  11. Project Presentations; Q&A.

Schedule

  1. Introduction: Big Data, Data Economy. Walkowiak (2016): Chapter 1.
  2. Computation and Memory in Applied Econometrics.
  3. Advanced R Programming. Wickham (2019): Chapters 2, 3, 17,23, 24.
  4. Import, Cleaning and Transformation of Big Data. Walkowiak (2016): Chapter 3: p. 74‑118.
  5. Aggregation and Visualization. Walkowiak (2016): Chapter 3: p. 118‑127; Wickham et al.(2015); Schwabish (2014).
  6. Data Storage, Databases Interaction with R. Walkowiak (2016): Chapter 5.
  7. Cloud Computing: Introduction/Overview.
  8. Distributed Systems, Hadoop with R. Walkowiak (2016): Chapter 4.
  9. Applied Econometrics with Spark; Machine Learning and GPUs.
  10. Project Presentations (7 May, 2020; 08:15-10:00; Room 23-103).
  11. Project Presentations; Q&A.

Schedule

  1. Introduction: Big Data, Data Economy. Walkowiak (2016): Chapter 1.
  2. Computation and Memory in Applied Econometrics.
  3. Advanced R Programming. Wickham (2019): Chapters 2, 3, 17,23, 24.
  4. Import, Cleaning and Transformation of Big Data. Walkowiak (2016): Chapter 3: p. 74‑118.
  5. Aggregation and Visualization. Walkowiak (2016): Chapter 3: p. 118‑127; Wickham et al.(2015); Schwabish (2014).
  6. Data Storage, Databases Interaction with R. Walkowiak (2016): Chapter 5.
  7. Cloud Computing: Introduction/Overview.
  8. Distributed Systems, Hadoop with R. Walkowiak (2016): Chapter 4.
  9. Applied Econometrics with Spark; Machine Learning and GPUs.
  10. Project Presentations (7 May, 2020; 08:15-10:00; Room 23-103).
  11. Project Presentations; Q&A.

Schedule

  1. Introduction: Big Data, Data Economy. Walkowiak (2016): Chapter 1.
  2. Computation and Memory in Applied Econometrics.
  3. Advanced R Programming. Wickham (2019): Chapters 2, 3, 17,23, 24.
  4. Import, Cleaning and Transformation of Big Data. Walkowiak (2016): Chapter 3: p. 74‑118.
  5. Aggregation and Visualization. Walkowiak (2016): Chapter 3: p. 118‑127; Wickham et al.(2015); Schwabish (2014).
  6. Data Storage, Databases Interaction with R. Walkowiak (2016): Chapter 5.
  7. Cloud Computing: Introduction/Overview.
  8. Distributed Systems, Hadoop with R. Walkowiak (2016): Chapter 4.
  9. Applied Econometrics with Spark; Machine Learning and GPUs.
  10. Project Presentations (7 May, 2020; 08:15-10:00; Room 23-103).
  11. Project Presentations; Q&A.

Schedule

  1. Introduction: Big Data, Data Economy. Walkowiak (2016): Chapter 1.
  2. Computation and Memory in Applied Econometrics.
  3. Advanced R Programming. Wickham (2019): Chapters 2, 3, 17,23, 24.
  4. Import, Cleaning and Transformation of Big Data. Walkowiak (2016): Chapter 3: p. 74‑118.
  5. Aggregation and Visualization. Walkowiak (2016): Chapter 3: p. 118‑127; Wickham et al.(2015); Schwabish (2014).
  6. Data Storage, Databases Interaction with R. Walkowiak (2016): Chapter 5.
  7. Cloud Computing: Introduction/Overview.
  8. Distributed Systems, Hadoop with R. Walkowiak (2016): Chapter 4.
  9. Applied Econometrics with Spark; Machine Learning and GPUs.
  10. Project Presentations (7 May, 2020; 08:15-10:00; Room 23-103).
  11. Project Presentations; Q&A.

Schedule

  1. Introduction: Big Data, Data Economy. Walkowiak (2016): Chapter 1.
  2. Computation and Memory in Applied Econometrics.
  3. Advanced R Programming. Wickham (2019): Chapters 2, 3, 17,23, 24.
  4. Import, Cleaning and Transformation of Big Data. Walkowiak (2016): Chapter 3: p. 74‑118.
  5. Aggregation and Visualization. Walkowiak (2016): Chapter 3: p. 118‑127; Wickham et al.(2015); Schwabish (2014).
  6. Data Storage, Databases Interaction with R. Walkowiak (2016): Chapter 5.
  7. Cloud Computing: Introduction/Overview.
  8. Distributed Systems, Hadoop with R. Walkowiak (2016): Chapter 4.
  9. Applied Econometrics with Spark; Machine Learning and GPUs.
  10. Project Presentations (7 May, 2020; 08:15-10:00; Room 23-103).
  11. Project Presentations; Q&A.

Examination: Part I

  • Decentral ‐ Group examination ‘paper’ (all given the same grades) (60%).
  • Group size: 3 (or 2) students.
  • Take‐home exercises: Application of basic concepts in R when working with big data. Conceptual questions related to the application.
Hand in on June 8 2020, 16:00
More details next week

Examination: Part II

  • Decentral
  • Group examination: presentation + code (all given the same grades) (40%)
  • Big data analytics group projects: Own approach/strategy, implemented in R, presentation of results in class.
7 May 2020, 08:15-10:00
(14 May 2020, 08:15-10:00)
More details next week

Approach

Prerequisites?

  • Basic R programming skills.
  • Build on concepts taught in Data Analytics I (and more basic econometrics courses).
    • Brief review of concepts, but no additional introduction.

R used in two ways

  • A tool to analize problems posed by large datasets.
    • For example, memory usage (in R).
    • (Idea behind ’advanced R programming part)
  • A practical tool for Big Data Analytics.

Example

Preparations

# read dataset into R
economics <- read.csv("../data/economics.csv")
# have a look at the data
head(economics, 2)
##         date   pce    pop psavert uempmed unemploy
## 1 1967-07-01 507.4 198712    12.5     4.5     2944
## 2 1967-08-01 510.5 198911    12.5     4.7     2945
# create a 'large' dataset out of this
for (i in 1:3) {
     economics <- rbind(economics, economics)
}
dim(economics)
## [1] 4592    6

Example

Compute the real personal consumption expenditures (pce): Divide each value of pce by the deflator 1.05.

# Naïve approach (ignorant of R)
deflator <- 1.05 # define deflator
# iterate through each observation
pce_real <- c()
n_obs <- length(economics$pce)
for (i in 1:n_obs) {
  pce_real <- c(pce_real, economics$pce[i]/deflator)
}

# look at the result
head(pce_real, 2)
## [1] 483.2381 486.1905

Example

How long does it take?

# Naïve approach (ignorant of R)
deflator <- 1.05 # define deflator
# iterate through each observation
pce_real <- list()
n_obs <- length(economics$pce)
time_elapsed <-
     system.time(
         for (i in 1:n_obs) {
              pce_real <- c(pce_real, economics$pce[i]/deflator)
})

time_elapsed
##    user  system elapsed 
##   0.119   0.016   0.136

Example

Assuming a linear time algorithm (\(O(n)\)), we need that much time for one additional row of data:

time_per_row <- time_elapsed[3]/n_obs
time_per_row
##      elapsed 
## 2.961672e-05

Example

If we deal with big data, say 100 million rows, that is

# in seconds
(time_per_row*100^4) 
##  elapsed 
## 2961.672
# in minutes
(time_per_row*100^4)/60 
##  elapsed 
## 49.36121
# in hours
(time_per_row*100^4)/60^2 
##   elapsed 
## 0.8226868

Example

What happens in the background?

  • Evaluation/computation
  • Memory allocation/deallocation

Example

Can we improve this?

# Improve memory allocation (still somewhat ignorant of R)
deflator <- 1.05 # define deflator
n_obs <- length(economics$pce)
pce_real <- list()
# allocate memory beforehand
# tell R how long the list will be
length(pce_real) <- n_obs

Example

Can we improve this?

# Improve memory allocation (still somewhat ignorant of R)
deflator <- 1.05 # define deflator
n_obs <- length(economics$pce)
pce_real <- list()
# allocate memory beforehand
# tell R how long the list will be
length(pce_real) <- n_obs
# iterate through each observation
time_elapsed <-
     system.time(
         for (i in 1:n_obs) {
              pce_real[[i]] <- economics$pce[i]/deflator
})

time_elapsed
##    user  system elapsed 
##   0.008   0.000   0.008

Example

Any improvements?

time_per_row <- time_elapsed[3]/n_obs
time_per_row
##     elapsed 
## 1.74216e-06

Example

# in seconds
(time_per_row*100^4) 
## elapsed 
## 174.216
# in minutes
(time_per_row*100^4)/60 
## elapsed 
##  2.9036
# in hours
(time_per_row*100^4)/60^2 
##    elapsed 
## 0.04839334

This looks much better, but we can do even better…

Example

Can we further improve this?

# Do it 'the R wqy'
deflator <- 1.05 # define deflator
# Exploit R's vectorization!
time_elapsed <- 
     system.time(
     pce_real <- economics$pce/deflator
          )
# same result
head(pce_real, 2)
## [1] 483.2381 486.1905
# but much faster!
time_elapsed
##    user  system elapsed 
##       0       0       0
time_per_row <- time_elapsed[3]/n_obs

Example

In fact, system.time() is not precise enough to capture the time elapsed…

# in seconds
(time_per_row*100^4) 
## elapsed 
##       0
# in minutes
(time_per_row*100^4)/60 
## elapsed 
##       0
# in hours
(time_per_row*100^4)/60^2 
## elapsed 
##       0

Example

Use microbenchmark::microbenchmark() to measure the elapsed time in microseconds (millionth of a second)

library(microbenchmark)
# measure elapsed time in microseconds (avg.)
time_elapsed <- 
  summary(microbenchmark(pce_real <- economics$pce/deflator))$mean

# per row (in sec)
time_per_row <- (time_elapsed/n_obs)/10^6

Example

Improvement with vectorization (again, assuming 100 million rows)

# in seconds
(time_per_row*100^4) 
## [1] 0.4982415
# in minutes
(time_per_row*100^4)/60 
## [1] 0.008304025
# in hours
(time_per_row*100^4)/60^2 
## [1] 0.0001384004

What do we learn from this?

  1. How R allocates and deallocates memory can have a substantial effect on computation time.
    • (Particularly, if we deal with a large dataset!)
  2. In what way the computation is implemented can matter a lot for the time elapsed.
    • (For example, loops vs. vectorization/apply)

Course Resources

Literature

Literature

Notes, Slides, Code, et al.

Suggested Learning Procedure

  • Clone/fork the course’s GitHub-repository
  • During class, use the Rmd-file of the slide-set as basis for your notes
  • After class, enrich/merge/extend your notes with the lecture notes.

TODO (for next week!)

Q&A

  • General questions about the course?
  • Exchange students: additional information regarding prerequisites

References

Bollen, Johan, Huina Mao, and Xiaojun Zeng. 2011. “Twitter Mood Predicts the Stock Market.” Journal of Computational Science 2 (1): 1–8. https://doi.org/https://doi.org/10.1016/j.jocs.2010.12.007.

Ranco, Darko AND Caldarelli, Gabriele AND Aleksovski. 2015. “The Effects of Twitter Sentiment on Stock Price Returns.” PLOS ONE 10 (9). Public Library of Science: 1–21. https://doi.org/10.1371/journal.pone.0138441.